Communication-minimal partitioning and data alignment of BLAS-like algorithms
نویسندگان
چکیده
Automatic data alignment and computation domain partitioning techniques have been widely investigated to reduce communication overheads in distributed memory systems. This paper considers how these two techniques can be combined and applied to BLAS-like algorithms. Current data alignment techniques focus on individual data entries and cannot be directly used for cases when blocks of entries should be aligned collectively. This paper shows that existing data alignment techniques can be applied to partitioned algorithms if the null space of the data array indexing matrix is a boundary of computation blocks or the intersection of some of the computation block boundaries. These conditions can be used to generate several diierent partitionings and time-space transformations from which the optimal ones can be chosen for a given target architecture. An example illustrates how it is possible to trade oo the number of communications and memory space. Another example shows partitions of matrix-matrix multiplication that have smaller communication-computation ratios than Cannon's algorithm does.
منابع مشابه
SNAP (Small-World Network Analysis and Partitioning) Framework
Discussion Both LAPACK and ScaLAPACK libraries contain routines for solving systems of linear equations, least squares problems, and eigenvalue problems. The goals of both projects are efficiency (to run as fast as possible), scalability (as the problem size and number of processors grow), reliability (including error bounds), portability (across all important parallel machines), flexibility (s...
متن کاملCommunication-Minimal Partitioning and Data Alignment for Affine Nested Loops
Data alignment and computation domain partitioning techniques have been widely investigated to reduce communication overheads in distributed memory systems. This paper considers how these two techniques can be combined and applied to af"ne nested loops. Current data alignment techniques focus on individual entries of data arrays and, in general, cannot be used directly for cases when blocks of ...
متن کاملLinear Algebra Research on the AP
This paper gives a report on various results of the Linear Algebra Project on the Fujitsu AP1000 in 1993. These include the general implementation of Distributed BLAS Level 3 subroutines (for the scattered storage scheme). The performance and user interface issues of the implementation are discussed. Implementations of Distributed BLAS-based LU Decomposition, Cholesky Factorization and Star Pro...
متن کاملLoad Balancing Problem for Parallel Computers with Distributed Memory
This paper deals with load balancing of parallel algorithms for distributedmemory computers. The parallel versions of BLAS subroutines for matrix-vector product and LU factorization are considered. Two task partitioning algorithms are investigated and speed-ups are calculated. The cases of homogeneous and heterogeneous collections of computers/processors are studied, and special partitioning al...
متن کاملAssessment of the Performance of Clustering Algorithms in the Extraction of Similar Trajectories
In recent years, the tremendous and increasing growth of spatial trajectory data and the necessity of processing and extraction of useful information and meaningful patterns have led to the fact that many researchers have been attracted to the field of spatio-temporal trajectory clustering. The process and analysis of these trajectories have resulted in the extraction of useful information whic...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007